Loading necessary libraries

library(robustbase)
library(psych)
library(MASS)
library(ellipse)
library(here)
library(DescTools)
library(knitr)
library(RobStatTM)

Robust univariate

Explicit robust estimators

Following the definitions outlined during the very first class of the module, explicit robust estimators are readily available in R. There is not very much to say about it here, let us only try to recreate the table presented in slide 18 of the Introduction and explicit robust estimators section.

x <-
  c(9.52, 9.68, 10.16, 9.96, 10.08, 9.99, 10.47, 9.91, 9.92, 15.21)

x_9 <- x[-10]

est_names <- c(
  "$\\bar{x}$ ",
  "$\\bar{x}_{0.1}$ ",
  "$\\tilde{x}_{0.1}$",
  "$Me(x)$",
  "$\\hat{\\sigma}$",
  "$MD(x)$",
  "$MAD(x)$",
  "$MADN(x)$",
  "$IQR(x)$"
)

explicit_est_x_9 <- c(mean(x_9),
    mean(x_9, trim = .1),
    psych::winsor.mean(x = x_9, trim = .1),
    median(x_9),
    sd(x_9),
    DescTools::MeanAD(x_9),
    mad(x_9,constant = 1),
    mad(x_9), # MADN
    IQR(x_9))

explicit_est_x <- c(mean(x),
  mean(x, trim = .1),
  psych::winsor.mean(x = x, trim = .1),
  median(x),
  sd(x),
  DescTools::MeanAD(x),
  mad(x,constant = 1),
  mad(x),
  IQR(x)
  )

kable(
  data.frame(est_names, explicit_est_x_9, explicit_est_x),
  digits = c(0, 3, 3),
  escape = FALSE
)
est_names explicit_est_x_9 explicit_est_x
\(\bar{x}\) 9.966 10.490
\(\bar{x}_{0.1}\) 9.966 10.021
\(\tilde{x}_{0.1}\) 9.952 10.078
\(Me(x)\) 9.960 9.975
\(\hat{\sigma}\) 0.272 1.678
\(MD(x)\) 0.186 0.944
\(MAD(x)\) 0.120 0.145
\(MADN(x)\) 0.178 0.215
\(IQR(x)\) 0.170 0.228

Implicit robust estimators

Univariate M-estimators can efficiently be created by means of the robustbase R package. It offers plenty of features (we will see some more further on in the lab) to customize your M-estimator, by even letting you define your own psiFunc, for constructing psi_func objects. We are not dwelling into the details about it, but if you are curious you can look up at the vignette available online. Let us stick with the two we have encountered in class, namely Huber and Tukey’s. They can easily be plotted by means of:

source(system.file("xtraR/plot-psiFun.R", package = "robustbase", mustWork=TRUE))
p.psiFun(seq(-3,3,length.out=100), "huber", par = 1.5, leg.loc="bottomright", main="T")

p.psiFun(seq(-5,5,length.out=100), "biweight", par = 1.5, leg.loc="bottomright",main="T")

We have already seen how to hard-code the iterative algorithm described in slide 13, but with an additional (robust) scale estimate \(\hat{\sigma}\).

manual_M_location <-
  function(x,
           k,
           type = c("Huber", "Tukey"),
           tol = sqrt(.Machine$double.eps),
           itermax = 1000) {
    
    
    mu <- mu_old <- median(x) # initial value
    rob_scale <- mad(x) # it will be kept fixed
    crit <- TRUE
    iter <- 0
    
    weigth_f <- switch (type,
                        "Huber" = function(x) pmin(1, k/abs(x)),
                        "Tukey" = function(x) ((1-(x/k)^2)^2)*(abs(x)<=k)
    )
    
    while(crit){
      w_i <- weigth_f((x-mu)/rob_scale)
      mu <- weighted.mean(x = x,w = w_i)
      err <- abs(mu-mu_old)
      
      mu_old <- mu
      iter <- iter+1
      
      crit <- (err > tol & iter < itermax)
    }
    list(mu=mu, s=rob_scale,it=iter)
  }

The robustbase package contains the huberM function for performing M-Estimation of location with MAD scale by means of Huber function

huberM(x = x,k = 1.5)
## $mu
## [1] 10.00333
## 
## $s
## [1] 0.214977
## 
## $it
## [1] 13
## 
## $SE
## [1] NA
manual_M_location(x = x,k = 1.5,type = "Huber")
## $mu
## [1] 10.00333
## 
## $s
## [1] 0.214977
## 
## $it
## [1] 12

In order to employ other \(\psi\)-functions, as well as other estimator types (S, MM, not covered in class), the nlrob is a very flexible function for robust fitting. Honestly, I do not find it straightforward to use, so we move directly to robust multivariate estimators.

Robust multivariate

A simple example

We consider a data frame with average brain and body weights for \(62\) species of land mammals and three other animal types.

data("Animals2")
# let us work with the natural logarithms of the weigths
Animals2$body <- log(Animals2$body)
Animals2$brain <- log(Animals2$brain)
plot(Animals2)

Does it look familiar? We have already seen a plot of these data during the class. The aim is to derive robust location and scale estimators employing the Minimum Covariance Determinant (MCD). There are several ways of doing it in R, with different routines available. We make use of the robustbase package, that contains the “Essential” Robust Statistics tools allowing to analyze data with robust methods.

The needed function is covMcd, whose main arguments are:

  • x: a matrix of data points
  • alpha: numeric parameter controlling the size of the subsets over which the determinant is minimized; roughly alpha*N
  • nsamp: number of subsets used for initial estimates or a character vector specifying whether “best”, “exact”, or “deterministic” estimation should be performed

Clearly, the most important hyper-parameter to be selected is alpha, as it approximately determines the size of the subsets used for parameter estimation, and the consequent BP of the estimator. Let us set alpha=0.75.

fit_MCD <- covMcd(x = Animals2, alpha = .75, nsamp = "best")
fit_MCD
## Minimum Covariance Determinant (MCD) estimator approximation.
## Method: Fast MCD(alpha=0.75 ==> h=49); nsamp = best; (n,k)mini = (300,5)
## Call:
## covMcd(x = Animals2, alpha = 0.75, nsamp = "best")
## Log(Det.):  0.5867 
## 
## Robust Estimate of Location:
##  body  brain  
## 1.292  3.074  
## Robust Estimate of Covariance:
##          body  brain
## body   10.999  8.164
## brain   8.164  6.529

The fit_MCD object contains the raw and reweighted estimates of location and scatter, the resulting robust Mahalanobis distances and the best subset used for computing the raw estimates.

ind_best_subset <- fit_MCD$best
N <- nrow(Animals2)
p <- ncol(Animals2)
plot(Animals2, col=ifelse(1:N%in%ind_best_subset,"black","red"),pch=19)

Notice that the raw MCD estimate of scatter is by default multiplied by a consistency factor

\[ c(p, \alpha)=\frac{\alpha}{F_{\chi_{p+2}^{2}}\left(q_{p, \alpha}\right)} \] (\(q_{p, \alpha}\) is the \(\alpha\) level quantile of a \(\chi^2_p\) distribution) and a finite sample correction factor, to make it consistent at the normal model and unbiased at small samples.

dplyr::near(fit_MCD$raw.center,colMeans(Animals2[ind_best_subset,]))
##  body brain 
##  TRUE  TRUE
dplyr::near(fit_MCD$raw.cov,cov(Animals2[ind_best_subset,])*prod(fit_MCD$raw.cnp2))
##       body brain
## body  TRUE  TRUE
## brain TRUE  TRUE
h <- fit_MCD$quan
dplyr::near(fit_MCD$raw.cnp2[1],(h/N)/pchisq(qchisq(p = h/N,df = p),df = p+2))
## [1] TRUE

Notice that by default, unless specifyingraw.only=TRUE, covMcd performs a reweighting step for improving the efficiency of the final estimator. In details, the new weights are computed as follows:

\[ W\left(d^{2}\right)=I\left(d^{2} \leqslant \chi_{p, 0.975}^{2}\right) \] where \(d^2\) is the squared Mahalanobis distance based on the scaled raw MCD.

ind_rew_obs <-
  which(
    mahalanobis(
      x = Animals2,
      center = fit_MCD$raw.center,
      cov = fit_MCD$raw.cov
    ) <= qchisq(p = .975, df = p)
  )
dplyr::near(fit_MCD$center,colMeans(Animals2[ind_rew_obs,]))
##  body brain 
##  TRUE  TRUE
dplyr::near(fit_MCD$cov,cov(Animals2[ind_rew_obs,])*prod(fit_MCD$cnp2))
##       body brain
## body  TRUE  TRUE
## brain TRUE  TRUE

robustbase provides also a nice plotting method that lets us fully explore the set of graphical tools we have seen in class in an almost automatic way:

plot(fit_MCD,classic=TRUE)

We see that the outliers do not influence the robust estimates: the same can only be partially said when employing the classical estimators.

Everything could have as always been hard-coded:

n <- nrow(Animals2)
sample_mean <- apply(Animals2, 2, mean)
sample_cov <- cov(Animals2)#*(n-1)/n # by default R computes the corrected sample variance

# MCD estimates
plot(Animals2, xlim=c(-10,15),ylim=c(-5,15))
lines(ellipse(x = fit_MCD$cov,centre=fit_MCD$center),lty=2, col="blue",type="l")
points(x=fit_MCD$center[1],y=fit_MCD$center[2], pch="x", col="blue", cex=2)

# ML estimates
lines(ellipse(x = sample_cov, centre=sample_mean),lty=2, col="red",type="l")
points(x=sample_mean[1],y=sample_mean[2], pch="x", col="red", cex=2)

# Robust estimates
plot(sqrt(fit_MCD$mah))
# Classical estimates
plot(sqrt(mahalanobis(x = Animals2,center = sample_mean, cov = sample_cov)))

A somewhat less simple example

Let us now consider a dataset containing measurements of \(p= 9\) characteristics of \(n = 677\) diaphragm parts used in the production of TV sets. Diaphragm are thin metal plates, molded by a press. The aim of the multivariate analysis is to gain insight into the production process and the interrelations between the nine measurements and to find out whether deformations or abnormalities have occurred and why.

data_philips <- readRDS(here::here("Block V - Robust statistics/data/data_philips.Rds"))
pairs(data_philips)

There is no clear visible pattern in the data, and no outliers seem to be present.

fit_MCD <- covMcd(x = data_philips, alpha = .75)
fit_MCD
## Minimum Covariance Determinant (MCD) estimator approximation.
## Method: Fast MCD(alpha=0.75 ==> h=510); nsamp = 500; (n,k)mini = (300,5)
## Call:
## covMcd(x = data_philips, alpha = 0.75)
## Log(Det.):  -64.75 
## 
## Robust Estimate of Location:
##       X1        X2        X3        X4        X5        X6        X7        X8  
## -0.04092  -0.31689  -0.05079   0.43806   2.12311   0.43384  -0.10401  -0.06584  
##       X9  
## -0.08844  
## Robust Estimate of Covariance:
##            X1          X2         X3         X4          X5         X6
## X1  0.0161197   0.0065964  0.0058713  0.0027880   0.0022950  1.563e-03
## X2  0.0065964   0.0113972  0.0018364  0.0007830  -0.0004515  6.220e-04
## X3  0.0058713   0.0018364  0.0068021  0.0016147   0.0022204  9.816e-04
## X4  0.0027880   0.0007830  0.0016147  0.0008569   0.0008314  3.597e-04
## X5  0.0022950  -0.0004515  0.0022204  0.0008314   0.0028981  8.686e-04
## X6  0.0015629   0.0006220  0.0009816  0.0003597   0.0008686  5.406e-04
## X7  0.0033217   0.0002781  0.0019223  0.0009564   0.0008422  2.860e-04
## X8  0.0002417  -0.0004847  0.0005638  0.0001670   0.0002286  1.742e-05
## X9  0.0022851   0.0013060  0.0014172  0.0004875   0.0007173  5.007e-04
##            X7          X8         X9
## X1  0.0033217   2.417e-04  2.285e-03
## X2  0.0002781  -4.847e-04  1.306e-03
## X3  0.0019223   5.638e-04  1.417e-03
## X4  0.0009564   1.670e-04  4.875e-04
## X5  0.0008422   2.286e-04  7.173e-04
## X6  0.0002860   1.742e-05  5.007e-04
## X7  0.0013263   3.112e-04  4.385e-04
## X8  0.0003112   4.495e-04  5.354e-05
## X9  0.0004385   5.354e-05  6.273e-04
plot(fit_MCD, classic=TRUE, labels.id=FALSE, which="distance")

plot(fit_MCD,labels.id=FALSE,which=c("dd"))

Robust distances report a strongly deviating group of outliers, ranging from index \(491\) to index \(565\). The reason being that the process changed after the first \(100\) points, and between index \(491–565\) it was out of control.

Robust regression

A simple example

Let us consider the Hertzsprung-Russell Diagram Data already encountered in the very first lab and consider the problem of regressing the logarithm of light intensity as a function of the logarithm of the effective temperature at the surface of the star.

We have already seen that the OLS estimator goes bananas

plot(starsCYG)
fit_lm <- lm(log.light~log.Te, data=starsCYG)
abline(fit_lm, col="red", lwd=2)
text(starsCYG$log.Te, starsCYG$log.light, 1:nrow(starsCYG), pos=1)

Yet, if we were to look at standard diagnostic plots we would not grasp how bad the situation actually is:

plot(fit_lm)

We would however if we fitted the Least Median of Squares (LMS) and the Least Trimmed Squares (LTS)

fit_lms <- lmsreg(log.light~log.Te, data=starsCYG)

# This is better than the previous
fit_lts <- ltsReg(log.light~log.Te, alpha=.75,mcd=TRUE,data=starsCYG) #ltsreg in the MASS package uses an older (slower) implementation of the LTS estimator
plot(starsCYG)
abline(fit_lm, col="red", lwd=2)
abline(fit_lms, col="darkblue", lwd=2)
abline(fit_lts, col="darkgreen", lwd=2)
legend("bottomleft", c('OLS', 'LMS', 'LTS'), lwd=rep(2,4), col=c("red", "darkblue", "darkgreen"))

As for the MCD estimator, robustbase provides plotting methods for LTS

plot(fit_lts)

A somewhat less simple example

We consider \(40\) cases of a study on production waste and land use, originally published in Golueke, C.G. and McGauhey, P.H. (1970), Comprehensive Studies of Solid Waste Management. The response variable is solid waste (millions of tons), while the remaining explanatory variables are industrial land (acres), fabricated metals (acres), trucking and wholesale trade (acres), retail trade (acres) and restaurants and hotels (acres).

data("waste")

fit_lts <- ltsReg(SolidWaste~., alpha=.75,mcd=TRUE,data=waste) 
summary(fit_lts)
## 
## Call:
## ltsReg.formula(formula = SolidWaste ~ ., data = waste, alpha = 0.75, 
##     mcd = TRUE)
## 
## Residuals (from reweighted LS):
##      Min       1Q   Median       3Q      Max 
## -0.17418 -0.04812  0.00000  0.02386  0.20312 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## Intercept    1.335e-01  1.862e-02   7.171 6.81e-08 ***
## Land        -2.734e-05  1.099e-05  -2.488 0.018850 *  
## Metals       1.483e-04  9.418e-05   1.574 0.126290    
## Trucking     5.132e-04  6.728e-05   7.627 2.08e-08 ***
## Retail      -7.077e-04  3.915e-04  -1.808 0.080995 .  
## Restaurants  8.031e-03  2.169e-03   3.702 0.000893 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08231 on 29 degrees of freedom
## Multiple R-Squared: 0.952,   Adjusted R-squared: 0.9438 
## F-statistic: 115.1 on 5 and 29 DF,  p-value: < 2.2e-16
plot(fit_lts)